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Abstract 

In real-life statistical data, it seems that conditional probabilities 
for the effect given their causes tend to be less complex and smoother 
than conditionals for causes, given their effects. We have recently 
proposed and tested methods for causal inference in machine learning 
using a formalization of this principle. 

Here we try to provide some theoretical justification for causal 
inference methods based upon such a "causally asymmetric" inter- 
pretation of Occam's Razor. To this end, we discuss toy models of 
cause-effect relations from classical and quantum physics as well as 
computer science in the context of various aspects of complexity. 

We argue that this asymmetry of the statistical dependences be- 
tween cause and effect has a thermodynamic origin. The essential link 
is the tendency of the environment to provide independent background 
noise realized by physical systems that are initially uncorrelated with 
the system under consideration rather than being finally uncorrelated. 
This link extends ideas from the literature relating Reichenbach's prin- 
ciple of the common cause to the second law. 

* e-mail: dominik . j anzing@t uebingen . mpg . de 
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1 Causal reasoning from statistical data 

Uncovering non-deterministic causal relations between observed quantities 
relies on the evaluation of statistical dependences and correlations in empiri- 
cal data. Two types of statistical data have to be carefully distinguished: in 
so-called experimental data, one observes the change of the distribution of one 
variable after interventions that control the value of the other. More often, 
one has to evaluate non- experimental data where no controlling intervention 
by the researcher is possible and he tries to draw causal conclusions merely 
from observed dependences in the statistics. Causal reasoning that relies on 
non-experimental data is likely to lead to serious misconclusions. The main 
obstacle is that statistical dependences between two random variables X and 
Y can be due to three types of (non-exclusive) causal relations. First, X may 
be a cause of Y, second, Y may be a cause of X, or third, there may be a 
(latent) common cause, i.e., a hidden variable Z effecting X and Y. This is 
usually referred to as the "principle of the common cause" [1]. 

If the variables X and Y are time-ordered and X refers to observations 
that precedes the observation of Y it is still hard to decide whether X ef- 
fects Y or there is a hidden common cause ( "confounder" ) Z. However, it is 
known that the joint distribution of at least 3 variables provides some hints 
on causal directions [1, 2, 3] via conditional independences among variables. 
For instance, if the stochastic dependence between X and Y is only gener- 
ated by some common cause Z (see Fig. 1, left), the variables X and Y must 
be independent with respect to the conditional probability given Z. On the 
other hand, if Z is a common effect of X and Y (see Fig. 1, right), the role 
of unconditional probabilities and conditional probabilities is reversed: the 
conditional probability given Z would then, in the generic case, generate de- 
pendences between the (actually independent) variables X and Y. Common 
effects cannot be accepted as an explanation for (unconditional) dependences 
but common causes can. Already Reichenbach [1] argued that this statistical 
asymmetry with respect to reversing causal arrows is linked to the thermody- 
namic arrow of time. Before we describe another asymmetry between cause 
and effect that we [4] have observed to be useful in causal reasoning and dis- 
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Figure 1: Causal fork (left) and a collider (right), the simplest example of the 
statistical asymmetry between cause and effect (see text). 

cuss its relation to statistical physics we first sketch the known approaches 
to causal inference from empirical data. 

Following [3, 2] we restrict our attention to causal structures without 
feedback loops and describe a causal structure as a directed acyclic graph 
(DAG) with random variables X\, . . . , X n as nodes. An arrow from Xi to Xj 
indicates that Xi directly influences Xj. Even though a definition of cause 
and effect would require deep philosophical discussions [3, 5, 6], we will de- 
fine causal relations by referring to hypothetical interventions. The variable 
X influences Y whenever adjusting X (by external control) to some different 
value x changes the distribution of Y (throughout the paper, we will capi- 
talize random variables and denote their values by lowercase letters). The 
influence from Xi on Xj is direct (relative to the set X\, . . . ,X n ) whenever 
the change of the distribution of Xj caused by different adjustments of Xi 
occurs also when all the other variables are fixed by an additional interven- 
tion. This definition makes clear that causal inference from non-experimental 
data infers probability distributions of hypothetical experimental data. The 
connection between the statistics of non-experimental observations and the 
causal graph (encoding information about the effect of hypothetical inter- 
ventions) is provided by the causal Markov condition. 

Definition 1 (causal Markov condition) 

Let G be a DAG with n random variables X\, . . . ,X n as nodes. A joint 
distribution P on these variables satisfies the (local) Markov condition with 
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respect to G if each variable is, given its parents, is conditionally independent 
of its non-descendants. 

Throughout the paper we assume that P(X\, . . . , X n ) has a probability den- 
sity p(x\, . . . ,x n ) with respect to some product distribution (note that this 
assumption does not exclude discrete variables since probability mass func- 
tions of discrete distributions are also densities). Then p(x±, . . . ,x n ) can be 
factorized into conditional probabilities for each variable, given its parents 
[7]: 

n 

p(x 1 ,...,x n ) = Y[p(x j \pa j ), (1) 

i=i 

where paj is a short notation for the subset of values xi, . . . ,x n that corre- 
spond to the parents PAj of Xj with respectto G. The conditional densities 
p(xj\pa,j) will be called the "Markov kernels" corresponding to the causal 
hypothesis G. Conversely, every choice of Markov kernels p(xj\pa,j) leads to 
a Markovian distribution. 

Following [3] we accept a causal hypothesis only if the observed statistical 
dependences are consistent with the Markov condition and mention also that 
this can be justified by so-called functional models: 

Definition 2 (functional model of causality) 

For each node Xj we introduce an additional noise variable Sj and assume 
that the actual value Xj of Xj is a deterministic function of Sj and all parents 
of Xj. All the Sj are jointly statistically independent. 

Then the Markov condition follows due to Theorem 1.4.1 in [3]. It should 
be noted that noise variable Sj is, by construction, statistically independent 
of the ancestors of Xj, but in the generic case there are dependences to the 
descendants of Xj. The relation of this asymmetry between cause and effect 
to the second law of thermodynamics will be discussed in Section 5. 

There are at least n\ causal graphs for which all distributions P are Marko- 
vian, namely every complete DAG (that is, a graph where each Xj has an 
arrow to Xj for every j > % if some arbitrary order of nodes is given). One 
therefore needs additional inference rules. So-called independence-based ap- 
proaches to causal inference [2, 3] are based upon the so-called faithfulness 
assumption: 
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Definition 3 (causal faithfulness condition) 

A joint probability distribution P on n random variables Xi, . . . , X n is faith- 
ful with respect to G if only those conditional independences are true which 
are implied by the Markov condition. 

The idea is the following: Given G and an independent choice of values for 
the free parameters p(xj\paj) it is unlikely to obtain a non-faithful graph. It 
is more natural to assume that an independence relation holds because it is 
entailed by the causal structure than that it is due to specific adjustments of 
the parameters p(xj\pa,j). Arguments of this kind are justified by referring 
to "Occam's Razor" [3]. Also Bayesian methods to causal discovery [8] are 
known to give an implicit preference to faithful structures provided that the 
priors are positive densities on the space of all p(xj\pa,j) [9]. 

Unfortunately, faithfulness leads rarely to a unique causal graph. More 
often there are still several possible causal hypotheses. Therefore, additional 
inference rules are desirable. In seeking new methods one must be aware of 
the fact that no inference principle can always lead to correct results since 
there is in principle no method to infer causal relations from non-experimental 
data that is always reliable. This is because one can construct a technical 
system with causal structure G that generates any desired joint distribution 
P that factorizes according to eq. (1). To this end, let each node j be a 
random generator whose inputs are provided by the parents of j and whose 
output is sampled according to p(xj\paj). Then the joint output x±, . . . ,x n 
is obviously sampled from p(x\, . . . , x n ). 

Recent proposals for alternative causal inference methods are based on 
the observation that in many cases p(xj\paj) are quite complex functions for 
one causal directions and simple for others [10, 4, 11, 12]. Then the idea 
is that the causal hypotheses for which the Markov kernels are simpler are 
more likely to be the true ones. We have proposed [13] to use this approach 
for post-selection of causal hypotheses after independence-based algorithms 
have already reduced the set of potential causal graphs. However, the case of 
two variables X and Y where the task is to distinguish between X — > Y and 
Y — > X is particularly interesting because independence-based approaches 
fail completely. We will therefore devote our main attention to this case. 

The idea that models in forward (time and causal) direction tend to 
be simpler than in backward direction, is certainly not new. The underly- 
ing intuition has influenced human and automated reasoning since a long 
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time. Psychological studies indicate that human intuition is better in esti- 
mating the strength of causal links (which is encoded in causal conditionals 
P(effect | causes)) than in inferring non-causal conditionals [14]. For this rea- 
son, it is straightforward that simplicity principles ("Occam's Razor") are 
automatically interpreted as simplicity of a model when described in the cor- 
rect causal direction. However, the author is not aware of any systematic 
exploration of the theoretical background of causally asymmetric interpreta- 
tions of Occam's Razor from a statistical physics point of view. 
The main ideas of this paper can be summarized as follows: 

(1) The paper describes various simple models from quantum and classical 
physics where the factorization of the joint distribution P(cause, effect) into 
P(cause)P(effect|cause) yields "simpler" terms than the "non-causal" fac- 
torization into P(effect)P(cause| effect). We will discuss different notions of 
simplicity for which this is likely to be the case. 

(2) For these models, we describe why the simplicity of causal and forward- 
time conditionals is because (a) the dynamical laws of motion and the Hamil- 
tonians are simple, and (b) the relevant systems start in statistically inde- 
pendent states rather than ending up in independent states. This point of 
view shows a link between the suggested asymmetry between cause and effect 
and the arrow of time in thermodynamically irreversible processes. 

While interactions between physical systems Si and S2 typically lead to 
mutual influence, we can nevertheless obtain well-defined causal directions: 

First, the variable X will refer to the state of Si at some time t and Y to 
the state of S 2 at some later time t' > t or X and Y refer to different time 
instants of the same system. 

The second approach is to turn the interaction between Si and S2 on 
only after the state of Si is adjusted to its present state in order to avoid 
backaction from S 2 to Si. 

The third approach is to choose physical conditions such that the influence 
of S2 on Si is negligible. We will, for instance, discuss non-equilibrium steady 
states with temperature gradient where this is the case. 

The paper is organized as follows. In Section 2 we sketch the inference 
rule proposed in [4] and our approach to define smoothness of probability 
distributions by constrained maximization of conditional entropy. Section 3 
describes physical experiments that are consistent with our inference rule. 
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We discuss how the examples had to be modified if one tries to obtain simple 
Markov kernels for the non-causal conditionals P ( cause | effect). In Section 4 
we discuss examples showing that causal conditionals are also simpler than 
non-causal ones with respect to other notions of simplicity, for instance, with 
respect to computational complexity. We consider the computational com- 
plexity of conditional probabilities connecting input and output of a boolean 
circuit with additional noise and argue that P(effect| cause) can efficiently be 
computed but P( cause (effect) cannot, provided that the inputs are indepen- 
dent. We describe how this asymmetry is linked to the thermodynamics of 
computation. 

Section 5 connects the asymmetry between cause and effect in the shapes 
of conditionals to the asymmetry stated by the causal Markov condition. 
Section 6 describes results that link the observed asymmetries to the ther- 
modynamics of non-equilibrium steady states. 

2 The principle of plausible Markov kernels 
and its motivation 

To explain our inference principle we consider complete DAGs. They are 
given by an arbitrary ordering of the n variables and drawing an arrow from 
each variable to every other that appears later in the order. Then the causal 
hypotheses are uniquely characterized by one out of n! possible orderings 
of the nodes ("causal ordering"). This is no loss of generality since the true 
graph can be obtained by removing statistically irrelevant parents, given that 
it is a subgraph of the hypothetical complete graph. The Markov kernels 
corresponding to a hypothetical causal order Xi, . . . ,X n are defined as the 
conditional probabilities p(xj\xi, x 2 , ■ ■ ■ , ^j-i)- 

The venue of our discussion will be the following vague formulation of our 
inference rule. 

Definition 4 (plausible Markov kernels method, abstract version) 

Prefer the hypothetical causal order Xi, . . . , X n for which the corresponding 
Markov kernels p(xj\xi, . . . ,Xj-i) are as simple and smooth as possible. 

How to define smoothness and simplicity in a reasonable way is, however, 
a difficult problem. As a first attempt, which provided some encouraging 
results, we have chosen the following definition [4]. 
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Definition 5 (second order Markov kernels) 

The simplest non-trivial conditionals p(xj\xi, . . . ,Xj-i) are those that max- 
imize the conditional Shannon entropy of Xj given X ± , . . . ,Xj_i subject to 
the given expectations E(Xj) = Cj and second moments E(XjXi) = for 
i = where Cj and d^ denote the ensemble averages of the corre- 

sponding quantities. 

The conditional Shannon entropy of Xj given X±, . . . , X,_i is defined by 
S(Xj\Xi, . . . , X n ) := — f p(xi, . . . ,Xj)\np(xj\xi, . . . ,Xj-i)dxi, ■ • ■ Xj, where 
the integral has to be read as a sum for the case of discrete variables. 

To include vector-valued variables Xj with components X^ with £ = 
1, . . . , rrij, one maximizes entropy subject to the constraints are given by 



for £ = 1, ... , nii, k = 1, . . . , rrij. 

The term "second order Markov kernel" is justified by the following known 
fact: 

Theorem 1 (second order Markov kernels, explicit form) 

The conditionals given by Definition 4 read 

p(Xj\x!, X n _i) = — r exp [ \^ QijXiXj + bXj ) , (2) 

z(x u ...,x n _ l ) J 
with appropriate constants a^, b and the partition function z(x±, . . . , Xj-i). 

Proof: We describe the proof for the continuous case, because the discrete 
one is even more straightforward. Let us first assume that the value set 
of Xj is restricted to a the interval [—A, A]. Then we can define a uniform 
distribution U(Xj\Xi, . . . , X,-_i) with density u(xj\xi, . . . , Xj-i) := 1/(2A). 
Maximizing Shannon entropy is then equivalent to minimizing the Kullback- 
Leibler distance 



E(X% = cf and E{xf xf ) = d!{f , 
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subject to the same constraints. Using Theorem 2.2 in [15], the solution is 
given eq. (2). With A — > oo we obtain the same solution without restricting 
Xj to a compact interval. □. 

Up to the partition function, the conditionals in eq. (2) are given by sec- 
ond order polynomials. Since first order polynomials cannot describe statis- 
tical dependences between variables, we have indeed the simplest non-trivial 
class of conditionals in the hierarchy [16] when we define models of kth order 
as those containing polynomials up to degree k. 

Our inference rule reads: 

Definition 6 (causal inference via second order Markov kernels) 

Estimate the first and second moments E(Xi) and E(XiXj) from the data set 
using the ensemble averages. For all hypothetical causal orders Xi, . . . ,X n 
compute the second order Markov kernels p(xj\xi, . . -Xj-i) in the sense of 
Definition 5 by maximizing conditional entropies subject to these moments. 
Decide by appropriate statistical tests for which ordering the obtained joint 
density p(xi, . . . , x n ) provides the best fit to the observed data. 

This approach should only be considered as a preliminary attempt to 
formalize simplicity. Instead of only describing the simplest conditionals as 
above we have also proposed [17] a method to quantify the complexity of con- 
ditional densities. Then causal inference is done by preferring the direction 
that minimizes the sum of the complexities of all Markov kernels. However, 
we will focus on the first approach. 

We describe two instances where our principle is very intuitive. First we 
consider two random variables X and Y where X is binary, i.e., its value set 
is {0, 1} and the value set of Y is 1L Assume that X influences Y. The best 
second order model for p(x) is just the distribution given by the observed 
relative frequencies. Using eq. (2) we obtain 

p{y\x) = 1 e ay 2 +b*y+c ( 3 ) 
y2ixa 1 

with appropriate a, b, c. For every x, p(y\x) is a Gaussian distribution where 
x determines the means [4] . Indeed, after having observed that the marginal 
distribution of Y is a mixture of two Gaussians and that the conditionals 
P(y\x) are simple Gaussians it seems very plausible to assume that X effects 
Y and not vice versa. 
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Now we consider the reverse situation where Y influences X. The second 
order model for p{y) generates the Gaussian distribution. For p(x\y) we 
obtain [4] 



with appropriate a, b G KL For this example, one checks easily that only the 
trivial case X JLY can have a second order model in both directions. 

In Section 3 we will analyze examples from physics with one binary and 
one continuous variable that are consistent with the above second order 
Markov kernels. We have decided to choose examples from quantum me- 
chanics for two reasons. First, the quantum world provides us with natural 
realizations of binary variables. Second, the simplicity of the models under 
consideration is intriguing. Nevertheless, quantum superpositions are not 
relevant for the arguments in the next subsections. 

3 Second order Markov kernels in physical 
models 

3.1 Stern-Gerlach experiment 

Consider first an experiment like the one designed by Stern and Gerlach 
in 1922 [18] to prove the quantization of the magnetic moment. A beam 
of atoms is emitted from a furnace and enters an inhomogeneous magnetic 
field perpendicular to the beam (see Fig. 3.1, here the field is in vertical 
direction 1 ), The field induces a force in the direction of its gradient which 
is proportional to the magnetic moment of the particles. For spin-1/2 par- 
ticles, for instance, the magnetic moment can attain the values +1/2, —1/2 
causing forces in opposite vertical directions. This effect can be used as a 
measurement apparatus for the quantum observable magnetic moment since 
it separates the beam into two parts that hit the screen at different verti- 
cal positions. We consider the values ±1/2 as the two values of a binary 
variable X. Even though quantum mechanical observables are in general 
not random variables on a probability space, this is well-justified because 
the quantum superposition already becomes incoherent (by creating entan- 

1 Diagram drawn by Theresa Knott, taken from the free encyclopedia wikipedia 
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Figure 2: Stern- Gerlach experiment. The atom beam emitted by the furnace 
splits up into two beams according to the spin. 

glement with the position degree of freedom) when the beam begins to split 
up. We define furthermore a random variable Y for the vertical coordinate 
of the point where the atom hits the screen. It is natural to assume that 
the conditional probabilities p(y\x = ±1/2) are both not too different from 
normal distributions. The following extremely simplified model, for instance, 
yields Gaussian conditionals. Before the particles have left the source they 
are subjected to some focusing forces. For simplicity we restrict our attention 
to the focus in vertical direction and assume that the forces are induced by a 
harmonic potential in vertical direction. In thermal equilibrium, the proba- 
bility distribution of momenta in a classical as well as in a quantum harmonic 
oscillator is Gaussian (see Section 3.3 and [19], respectively). Assuming that 
the probability distribution of the particle momenta is still Gaussian when 
they leave the source we obtain for both spin values Gauss distributions for 
Y with different expected values. 

Even though the terms cause and effect are even more philosophically 
problematic when quantum effects come into play, we claim that X influences 
Y: if we subject a spin measurement to the particles before the beam passes 
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the inhomogeneous field and remove all atoms with spin down, for instance, 
we get only one branch of the beam. Given the simplified assumptions above, 
the Markov kernel p(y\x) coincides with the second order kernel in eq. (3). 

Assume now, we had observed a Gaussian marginal distribution for Y 
instead of Gaussian conditionals and a conditional p(x\y) as in eq. (4). Our 
inference rule would then assume that Y is the cause. For this reason we want 
to check whether there are modifications of the Stern-Gerlach experiment 
which keep the causal direction but generate such a distribution. We could, 
for instance, assume that the transversal potential in the furnace is strongly 
anharmonic such that the particle momenta Z are distributed according to 
some probability density q(z) after the particles have left the source. Due to 
the laws of motion, we assume that y is a linear function in z for both spin 
values: 

y = az + b + and y = az + 6_ . 

Here b± denotes the shift of the expected values caused by the magnetic 
moments of particles with spin x = ±1/2 and a G R is some constant. In 
order to obtain Gaussian marginals for Y, q(z) must be such that the convex 
sum of q(z) and its shifted copy is Gaussian. To see that this is impossible 
we recall that the Gaussian measure could then be written as a convolution 
of q(z) with a measure \i that is supported by two points. Hence the Fourier 
transform of fi multiplied with the Fourier transform of q(z) would be the 
Fourier transform of a Gaussian which is again a Gaussian (up to a phase 
function). But this is in contradiction to the fact that the Fourier transform 
of \i has zeros. This shows that Gaussian marginals for Y cannot be obtained 
by choosing a "contrived" potential only. We would also have to modify 
the laws of motion given by the magnetic field. One could, for instance, 
have a field with strongly inhomogeneous field gradient such that atoms with 
different transversal momenta enter locations with different field gradient. 

Fig. 3 shows a simplified graphical model of the causal structure: the 
position Y is here assumed to be a deterministic function f(x,z) of the 
binary variable spin X and a "noise" variable Z, the initial momentum. 
Smoothness of the conditional p(y\x) is here due to the smoothness of / and 
the smoothness of the distribution of momenta. Last but not least, we should 
stress the decisive assumption that spin and initial momenta are statistically 
independent. 
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Figure 3: Graphical model of the causal structure of the Stern-Gerlach experiment 
(see text), where the initial momentum represents the noise (that is not explicitly 
considered in the causal structure X — ► Y). 

3.2 Spin in a stationary magnetic field 

Now we present an example where a continuous classical variable Y influences 
a discrete variable X. Given a spin-1/2 particle subjected to a field in z- 
direction whose (randomly fluctuating) strength is represented by the random 
variable Y. The binary variable X represents here the possible outcomes 
X = ±1/2 for a spin measurement in z direction. They occur in thermal 
equilibrium with the Boltzmann probabilities, i.e., we have 

with an appropriate constant a containing Boltzmann's constant k, temper- 
ature and the magnetic moment. This is because the density operator of a 
quantum system with Hamiltonian H and temperature T is given by 

p=-e-& H , (6) 

z 

where z is the appropriate normalization factor. The conditional probability 
for the effect X given the cause Y then is the second order model in eq. (4)). 

For a d/2-spin system having the d+1 possible values j = —d/2, —d/2 + 
1, . . . , d/2 for the spin in a given direction, second order models are condi- 
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tionals of the form in eq. (3), i.e., 



exp(-ajy + bj 2 + c) 

p[x — j \ y) — — jj^ , 

J2iL d /2 exp{-aly + bl 2 + c) 

(with appropriate constants a, b) as "plausible" conditional distributions. 
This parametric family contains the physically correct Boltzmann proba- 
bilities 

exp(-ajy) 



E?=-d/2 ex p(- a ^) 



p(x = j\y)- 

by setting b = c = 0. 

As in the Stern-Gerlach experiment, we try to modify the setup (for spin 
1/2) such that the same causal mechanism leads to a second order model in 
the opposite causal direction. Then p(y) would be a mixture of two Gaussians 
with equal variance a 2 but different expected values m±, i.e., 

X [*- 1 /2)exp(-^±)!) 

+ p(x = -l/2)exp(-(l^£)], 

To have a field strength that is a mixture of two Gaussians is a priori not 
unphysical even though it occurs probably less often than having a unimodal 
field strength. In order to generate the corresponding Gaussian conditionals 
p(y\x = ±1/2) we had to choose the constant a in eq. (5) such that 



p(x = 


1/2 1 y) 


p(x = 


1/2) 


p(x = 


-1/2 \y) 


p(x = 


-1/2) 



f-(y -m + ) 2 - (y-m. 
eXP l 2^ 



Comparing this to the Boltzmann probabilities in eq. (5) we conclude that 
the temperature has to be chosen such that the constant a satisfies a/2 = 
2(m_— m+). In contrast to the modifications in the Stern-Gerlach experiment 
that were required to "outsmart our principle" the causal mechanism as such 
has not to be modified here. There is nevertheless a constraint that makes 
the described situation unlikely to occur unless the setup was designed by 
hand: The fact that the temperature value has to be adjusted to one specific 
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value that is derived from m± (even though there is no physical reason that 
makes this coincidence likely) shows that the counterexample is non-generic 2 . 

Here, the reason why thermodynamics predicts a smooth conditional 
probability for the effect given the cause is, abstractly speaking, the following. 
The equilibrium states maximize entropy subject to the energy. Here the en- 
ergy depends smoothly (just linearly) on the cause (i.e. the field strength). 
Hence the smoothness of the conditionals is due to the smoothness of the 
physical Hamiltonian. 

3.3 Thermal equilibrium with artificial adjustments 

The following setup may be a bit artificial from the physics point of view. 
However, it provides a first impression on the link between the causal direc- 
tion and the order of maximizing the entropies of subsystems that is essential 
for our first implementation of the plausible Markov kernel principle. Given 
two classical systems described by continuous variables X, Y and a joint 
Hamiltonian of the form 

H(x, y) = H^x) + H 2 (y) + H 12 (x, y) . 

Consider the following three hypothetical experiments. For reasons of con- 
venience, we will identify the systems with the variables representing the 
physical states. 

(1) System X and Y influence each other 

Subject the joint system to a thermal bath with inverse temperature (5. If it 
is thermalized, its statistical state is given by 

P~(x, y) ■= ^ exp ( - P{H x {x) + H 2 (y) + H 12 (x, y)f) , 

where z is the partition sum. 

2 In [20] we have argued that this implies that P(cause) and P(effect|cause) share algo- 
rithmic information which suggests to prefer the opposite causal direction. 
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(2) System X influences Y 

Remove the interaction term if 12 , subject system 1 to the bath, adjust the 
state of system 1, i.e., fix the actual value x of the variable X. Couple both 
systems by the interaction H u and subject the joint system (or only system 
2) to the bath. Thermalization leads to 

p^(x,y) = p^(x)p^(y\x) , 

with 

p^(x) := — exp ( - PH^x^j , 
where z\ is the corresponding partition integral and 

p^(y\x) := j-j^j exp ( - (3(H 2 (y) + H 12 (x, y)) , 

with the partition function 

z^(x) := J exp(-/3(# 2 (y) + # 12 (x,y))dy. (7) 

(3) System Y influences X 

Let p^(x,y) be the density generated by the same scenario (2) with inter- 
changing the roles of system 1 and 2. 

Experiment 1 describes bidirectional influence, in experiment 2 X is the 
cause and Y the effect and in experiment 3 we have the reversed case. Now we 
want to discuss under which circumstances the three distributions coincide. 
For simplicity, we denote by = equality up to an additive constant for the 
logarithm of unconditional distributions. We have certainly q(x) = r(x) if 

and only if lnq(x) = lnr(a;). In analogy to eq. (7), we introduce the partition 
function 

z^(y) := j ' e -P( H ^y)+ H ^dx. 

We then have 

\np^(x,y) ± -(3^H 1 (x)+H 12 (x,y)+H 2 (y)+z^(x)^ (8) 
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lnp^(x,y) ± -l3(H 1 (x) + H 12 {x,y) + H 2 (y) + z^{y) > j (9) 
lnp^y) ± -p(H 1 (x) + H 12 (x,y) + H 2 (y)y (10) 

One checks easily that = and = p<_(:r|?/). Hence 

the difference between and p^ is only caused by different marginal distri- 
butions for X. While is directly determined by the free Hamiltonian 
Hi(x), the computation of p++(x) involves the partition function z^(x). We 
obtain: 

lnp^(x, y) — lnp^(x, y) = — In z + In z\ + In z^(x) =: f(x) , (11) 

We see that here the partition functions are "responsible" for the fact that 
different causal directions lead to different joint distributions because the 
logarithms of probabilities are Hamiltonians up to a complex function of the 
cause. The following theorem shows under which circumstances the different 
scenarios yield different joint distributions: 

Theorem 2 (asymmetries caused by partition function) 

The following conditions are necessary and sufficient that the joint distribu- 
tions in the above scenario coincide: 

1. p^ = p^ if and only if the partition function z_ is a constant and 
p^ = p<_ if and only if 2«_ is constant. 

2. p^ = p^ if and only if both and z^ are constants. 

3. the equalities p^(y\x) = p^(y\x) and p^{x\y) = p„(x\y) always hold. 

Proof: The first part of the first statement follows by combining eqs. (8) 
with (10), the second part follows from symmetry arguments. The second 
statement follows from combining eqs. (8), (9), and (10) and the fact that a 
function depending on x can only be equal to a function of y up to a constant 
if both functions are constants. □ 

The above scenarios are an example where the joint distribution is ob- 
tained by first maximizing the entropy of the cause variable subject to the 
corresponding free Hamiltionian and then maximizing conditional entropy of 
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the effect, given the cause, subject to the joint Hamiltonian. In other words, 
the order of maximizing the entropies coincide with the causal order. 

To see that natural Hamiltonians often will lead to second order Markov 
kernels, let system 1 and 2 be systems with many degrees of freedom, i.e., 
x and y are vector- valued variables (xi, . . . , x n ) and (yi, . . . , y m ). The set of 
Hamiltonians that occur, are often quadratic terms in the relevant variables 
(e.g. the canonical variables qi,Pi). Then the free Hamiltonians are of the 
form 

H\{x) = ^2 a j x j + ' 

3 ij 

and similarly for H 2 (y). An important class of possible interaction is given 
by 

n m n m 

H 12 (x, y) = ^ lij x i x 3 + ^2 VijViVj + ^2^2 ^iVj » 

i,j=l i,j=l i=l j=l 

with parameters 7^, 77^, e^. A natural example would be d + f linearly cou- 
pled harmonic oscillators where n = 2d and m — 2/ and the x^ are positions 
and momenta for d oscillators and y^ for the remaining / oscillators. 

For anharmonic oscillators, one could also have polynomials of higher 
degree. To discuss an example of a system where the Hamiltonian is not a 
polynomial in the canonical variables, we recall that the potential energy of 
an electron in a coloumb field of a positive particle is proportional to 1/x 
where x is the distance to the particle. The total energy thus is thus a sum 
of a polynomial of second order (the kinetic energy) and the 1/x term. 

However, second order polynomials already provide a class of systems 
that occur quite often. The following theorem is a simple conclusion from the 
above remarks. Its intention is to stress that the simplicity of Hamiltonians 
is inherited to the causal conditionals P (effect | cause) but not necessarily to 
the non- causal ones. 

Theorem 3 (secord order Markov kernels in equilibrium) 

Let Si and S 2 be two classical physical systems with observables x := 
\X\ 1 ■ ■ ■ 1 x n ) and y := (y 1 ,...,y m ) and assume that their free Hamiltonians 
Hi(x), H 2 {y) and their interaction Hamiltonian H 12 (x,y) are polynomials of 
second order. Let system S\ causally influence system S 2 in the sense of the 
above scenario where the state of Si is adjusted to the observed value before 
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the interaction with S2 is turned on. Then p(x) and p(y\x) are second order 
Markov kernels. 



3.4 Stationary process with temperature gradient 

In the preceding section, the well-defined causal arrow was put in by hand. 
Now we discuss a natural physical scenario where back action is negligible 
and show that the order of entropy maximization also coincides with the 
causal order 3 . To this end, we present a model consisting of two baths with 
different temperatures. 

Following [22] we consider two classical systems 1 and 2, described by 
variables X and Y, respectively, and a Hamiltonian H(x,y). System j is 
subjected to temperature Tj. Then [22] describes the coupled Langevin equa- 
tions 



where Tj is the damping constant for system j and d x and d y denote partial 
derivatives. 

Now x is assumed to change more slowly than y which is ensured by the 
condition Ti T 2 . Then it is argued that one may keep x fixed and solve 

3 This subsection is related to [21], where we have considered a model with two inter- 
acting systems with non-equilibrium states. After we assumed separation of dynamical 
time-scales, a well-defined causal arrow emerged whenever the interaction is sufficiently 
weak compared to the free Hamiltonian of the system that acts as a cause. In this lim- 
iting case, the stationary of the joint system has the following properties. The state of 
the system representing the cause was given by a microcanonical distribution of its free 
Hamiltonian and the state of the system representing the effect by a microcanonical distri- 
bution of its conditional Hamiltonian. Apart from this, it turned out that in the described 
limit the thermodynamics of the "cause-system" is a well-behaved thermodynamic system 
whose coarse-grained entropy is only increasing but never decreasing. Hence the conditions 
to have well-defined thermodynamic properties of subsystems turned out to be related to 
having well-defined causal directions. However, the setting discussed in the present paper 
is more appropriate to motivate the method in Definition 6 




(12) 
(13) 



with stochastic forces rjj whose product satisfies 



E(r h (t)r lj (t)) = 2r i T i 5 ij 5(t-t'), 
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equation (13) for Y and obtain the x-dependent equilibrium 

p(y\x) = ^exp(-PiH(x,yf), (14) 

with the partition function 

z(x) := J exp ( - fcHix, y)j dy , 

and the inverse temperatures f3j := l/(kTj). In order to calculate p(x) we 
average the energy value H(x,y) according to p(y\x) in eq. (14) and obtain 
from eq. (12) the Langevin equation 

Tirr = 5 x H cS (x) +r}i(t) 

with the effective Hamiltonian 

H eS (x) := -kT 2 \n J exp(-p 2 H(x, y)) dy . 

Then we obtain 

p{x) = - e -h H «W (15) 
and compute the joint distribution using 

p(x,y) := p(x)p(y\x) . 

As has been shown in [22] that p(x, y) can be obtained by maximizing 
T 1 S{X)+T 2 S{Y\X) subject to 

p(x,y)H(x,y)dxdy = e, 

for an appropriate value e. This indicates that the limit Ti 3> T 2 yields a 
joint distribution that is obtained by first maximizing the entropy of system 
1 and then maximizing the conditional entropy of system 2. To study this 
limit we write 

H(x, y) = H^x) + H 2 (y) + H 12 (x, y) . 

Obviously, we have 

H cS (x) = H^x) - kT 2 J e -M H Mx,y)+H2(y)) dy _ ( 16 ) 
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Now we consider the regime where the interaction kT\ but not small 
compared to kT 2 . Then, intuitively speaking, system 1 does not feel the 
interaction if 12 , but influences system 2 via H 12 . Formally, we consider a 
sequence of temperatures := nT x and rescale the free Hamiltonian of 
system 1 by defining H[ n \x) := nHi(x). The interaction energy and T 2 will 
be kept constant. With [3^ := f3\/n and eq. (16) we obtain 




which yields 

Using eq. (15), the sequence of marginal distributions converge to 

p{x) = - t e-^ H ^ . 

z 

We conclude: If kT\ is large compared to the interaction energy the joint dis- 
tribution of the bipartite system is obtained by (1) maximizing the entropy 
of system 1 subject to the energy corresponding to its free Hamiltonian and 
then maximizing the conditional entropy of system 2 subject to the total en- 
ergy. We obtain the same statement by decreasing H 12 , H 2 and T 2 according 
to a common scaling factor. 

The fact that in these limits the distribution of X is determined by the free 
Hamiltonian alone is, from an intuitive perspective, already a good indicator 
for the fact that the influence of system 2 on system 1 goes to zero. But 
our intention is to support this way of reasoning, not to takes it for granted. 
In order to show that we may indeed consider the variable X as the cause 
and variable Y as the effect (in the above limits), we show that system 1 
is insensitive with respect to adjusting system 2 to different values as in 
the preceding subsection. To quantify the influence of Y on X we derive 
an upper bound on the relative entropy distance between the following two 
distributions (1) the distribution po(x) that would be obtained for system 1 
without interaction and (2) the distribution p^{x\y) that system 2 induces 
when it is adjusted to some specific value y: 

Lemma 1 (upper bound on the back action) 

Let p^ be defined as in Section 3.3 and y be arbitrary, but fixed. Define 

p (x) = -exp(-/3iifi(z)) . 



21 




This shows that the back action indeed converges to zero for f3\ — > 0. 
Proof: The relative entropy distance reads 



It is natural to ask whether one could also construct a limit where the fast 
system influences the slow one without significant back action by assuming 
T 2 ^> Ti. However, the rescaling :— nT 2 and H^\y) := nH 2 (y) leads 
for n — > oo to a conditional P(y\x) = exp(— H 2 (y))/z, i.e., X and Y become 
independent. 

Note that there is a nice way to quantify action and back action in the 
above "generalized equilibrium" by a hypothetical sender /receiver protocol. 
Assume a sender having access to system 1 adjusts his system to one value 
x according to the marginal p(x) in eq. (15) above. Then the receiver ob- 
serves values y with probability p^(y\x). His information about X is given 
by the relative entropy distance between P^(X,Y) = P(X)P^(Y\X) and 




□ 



P(X)P^(Y). Hence 
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where we have used that relative entropy is convex [23]. If we define I<-(X : 
Y) in an analogue way, it follows that the "back action" -information I^(X : 
Y) tends to zero (in the limit n — > oo). This is because calculations similar to 
the proof of Lemma 1 show that D(P(X\y)\\P(X\y')) converge to zero for all 
y, y' . On the other hand, the "forward information" I^(X : Y) converges to a 
non-zero value because the joint distribution on X, Y obtained by the forward 
sender /receiver scenario coincides exactly with the natural equilibrium P 
defined by eqs. (14) and (15) where we indeed have statistical dependences. 

4 Different aspects of simplicity 
4.1 Random walk on integers 

The second order Markov kernels are simple with respect to the following 
two criteria: (1) The conditionals p(xj\x±, . . . , £j_i) depend smoothly on Xj 
and (2) they depend smoothly on x±, . . . , Xj-\. Now we will describe another 
aspect of simplicity that does not fit into these two categories. 

Consider a random walk on Z (the set of integers) starting at position 
0. In every step we move either one site to the left or one site to the right 
with probability 1/2 each and stop after n steps. Accordingly, we define the 
random variables Xi, . . . , X n with values in Z describing the position after 
step 1, ... ,n. The causal structure of the walk is certainly given by the linear 
directed graph 

Xl ^X 2 ^ >X n . (18) 

The corresponding conditionals for every variable, given its parent node read: 

= ±1 ' = i ' 2 "<^->> = { l o for K^d: 1 ■ 

The conditional independences entailed by the causal structure (18) are also 
consistent with the reverse causal hypothesis 

X n -> X n _! -> >X 1 . (19) 

To see this, recall that we only need to check the Markov condition (Defini- 
tion 1). Given its parent Xj + i, every Xj must be conditionally independent 
of all its non- descendants (except from its parent), i.e., the variables X, +2 , 
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Xj + 3, ■ ■ ■, X n . Using the d-separation criterion in [3] one can easily show 
that this follows from the Markov condition corresponding to the true causal 
structure (18). Due to eq. (1) the joint probability then admits the factor- 
ization 

p(x 1: ...,x n )= p(x n )p{x n _ 1 \x n )p{x n _ 2 \x n -i) • • -pfxiN) • 

The conditionals p(xj-i\xj) are, of course, also "simple" in the sense that 
they vanish for every pair Xj) for which — Xj\ ^ 1. However, the 

conditionals are less simple in the sense that p(xj-i\xj) depends on j. To see 
this, assume that we are on position £ after £ steps. Then the position after 
step £ — 1 was definitely £ — 1. In other words, the two cases Xj—\ Xj — i 
and Xj-i — Xj = —1 are not equally likely for the backward time conditional 4 
and the bias depends on j. 

The random walk represents another aspect of simplicity that is not taken 
into account in any of our inference rules proposed so far. It is the simplicity 
of the dependence on the nodes in the sense that the function j \— > p(xj\xj-i) 
is simple since it is even constant in j. The "physical" reason is that the 
mechanism that determines the transition probabilities is constant. 

Due to the thermodynamic spirit of this paper it is worth mentioning 
that the discussed time asymmetry is "fading away" after many steps. This 
is because 

to \ S P( x i-i = m )/ 2 for \£-m\ = l 

Pfo=^--i = »0 = ( o otherwise " (20) 

Since we have = m) ~ p(xj-\ — m ± 1) for large j the expression 

on the left hand side of Eq. (20) becomes asymptotically symmetric with 
respect to exchanging £ and m. To obtain a strictly time-symmetric analogue, 
consider a random walk on a cycle consisting of N sites. If the initial position 
is completely unknown, i.e., p(xi) = 1/N for all X\ G {0, 1 . . . , N — 1}, the 
process is perfectly symmetric with respect to time inversion. 

4 From the psychological point of view, it is remarkable that one is tempted to think that 
the backward-time conditional would be the same for this example as the forward-time 
conditional. This is consistent with a remark in the introduction: Our intuition seems 
to evaluate the simplicity of a model according to the simplicity of causal conditionals 
because we do not even recognize when a model is complex in the converse direction. 
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4.2 Computational complexity in logical circuits 

Here we want to describe an asymmetry with respect to another notion of 
complexity, namely the complexity classes of computer science. 

First we consider a boolean function / with n + m bits input and k 
bits output. Let X = (X 1: . . . ,X n ) be the vector of binary variables that 
describe the first n input bits and Z = (Zi, . . . , Z m ) be the vector for the 
last m input bits. Let furthermore Y := (Yi, ...,Yfc) describe the output. 
Now we interpret X as the cause, Y as the effect and Z as a noise variable 
that makes the causal mechanism probabilistic. We will assume that the total 
input (including "cause" and "noise" ) is obtained by statistically independent 
initialization of the bits. 

The following statement is almost obvious: 

Observation 1 (approximating P (effect | cause) is efficient) 

Given a string d including 

1. a description of a boolean circuit in terms of elementary gates like AND, 
NAND, OR, NOR, NOT that computes / and 

2. a description of a product probability distribution for Z. 

Given some e with e = l/poly(|d|) and some constant c G (0,1). The 
problem to decide for a given pair x, y whether 

p(y\x) > c + e or p(y\x)<c — e 

is in BPP ("Bounded-error, Probabilistic, Polynomial time" [24]). In other 
words, there is a probabilitistic algorithm whose running time increases only 
polynomial in \d\ solving the above decision problem such that the error 
probability is smaller than some previously specified constant 5 > 0. 

The "algorithm" for this decision problem is already given by setting the 
input to x, randomizing the noise variable Z according to the given distri- 
bution, simulating the boolean circuit and counting the number of runs with 
output y. 

Is should be noted, however, that an exact computation of P(y\x) is not 
possible in any efficient way provided that the complexity classes j^P ( "sharp 
P") and BPP do not coincide. To see this, we set n = and k — 1, i.e., the 
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binary variable Y is only a function of the noise Z. Let the values of the 
noise variable be uniformly distributed, i.e., p{z) = l/2 m for all z G {0, l} m . 
Then p(y\x) = p(y) is, up to the constant l/2 m simply the number of inputs 
z for which f(z) = 1. The problem to count the number of satisfying inputs 
for a boolean function (given in so-called conjunctive normal form) 

/:{0,ir ^{0,1} 

is complete for the complexity class #P. This class is believed to contain 
extremely hard computational problems [24]. However, the hardness of giv- 
ing exact solutions is probably of minor relevance and we will now consider 
approximative solutions. 

We will see that P (cause | effect) is even hard to compute approximately: 

Theorem 4 (approximating P( cause | effect) is NP-hard) 

Let the assumptions and definitions be as in Observation 1 with general 
n,m,k G No and p(x) be some product distribution. Then the problem 
to decide whether p(x\y) > 2/3 or p(x\y) < 1/2 for a given pair (x,y) G 
{0, l} n x {0, l} k is NP-hard. 

Proof: NP-hardness can even be proved for the special instance m — 0, 
i.e., without introducing a noise variable Z. This shows that the general 
problem contains NP. Let g : {0, l} n — > {0, 1} be a boolean function. To 
decide whether there is a binary string x with g(x) = 1 is known to be 
NP-complete [24]. We chose g such that g(0, . . . , 0) = 0. It is clear that 
the restriction to this class of functions g remains NP-complete. Then we 
define a function / by f(x) = g(x) V h(x) with h(x) = 1 for x — (0, . . . , 0) 
and h(x) = otherwise. Let now the distribution of x be uniform and 
consider p(x = (0, ...,0)\y = 1). If g gas no satisfying input x we have 
p(x = (0, . . . , 0)\y — 1) = 1 since x — (0, . . . , 0) is the only satisfying input 
for /. If g has a satisfying input, / has at least two satisfying inputs and 
p(x = (0,...,0)|y = l)<l/2. □ 

To better understand the reason for the asymmetry between the complex- 
ity of computing P(effect| cause) and P( cause | effect) we extend the boolean 
function / to a bijective function F. 

The Toffoli gate [25] provides a useful method to simulate conventional 
boolean circuits by reversible ones. TOFFOLI is a gate with three inputs 
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a, b, c and three outputs a', b', d such that a' = a, b' = b and d = c © (a n 6) 
where © denotes the exclusive or ("XOR"). In words, the third bit c is 
inverted if and only if a and b are true. TOFFOLI can simulate NAND by 
setting the third input to c = 1. Then we have d = a fl b and the outputs a', b' 
can be ignored (note that the existence of "useless" output ( "data garbage" ) 
and the need for adjusting certain input bits to fixed values is characteristic 
for reversible computation). 

Since NAND is universal and gates like AND, OR, NOR, NOT can be 
simulated using a small number of NAND gates we can simulate every given 
boolean circuit with TOFFOLI gates efficiently Furthermore, a correspond- 
ing reversible circuit can be found efficiently by substituting every single gate 
with some TOFFOLI gates. 

This yields an algorithm to extend / (having n + m input bits and k 
outputs) to a bijective boolean function F with n = n + m + r inputs and 
k + I = n outputs (described by Y, W) such that for some additional r-bit 
string v the restriction of F(x, z, v) to the first k output bits coincides with 
f(x, z). We may without loss of generality consider the ancilla variables V as 
additional noise variables since we can specify the corresponding distributions 
such that the "noise" variable always attains the same value. Hence we obtain 
a boolean function F with n+m input bits and k+l = m+n output bits such 
that the conditional probabilities for the output y given the input x coincide 
with the probabilities p(y\x) generated by the function /. We have then 
simulated the causal effect from X to Y by a completely reversible process 
using a noise variable Z and restricting the output Y, W to Y (see Fig. 4.2). 

It is important to note that the inverse function F _1 can be computed 
efficiently: every TOFFOLI gate is its own inverse. We can therefore simulate 
the circuit in backward direction. In such a setup both p(y\x) and p(x\y) 
are efficiently computable provided that Y is the complete output. This is 
because we can compute the complete input (x',z) = F _1 (y) from y. Then 
we know that p{x\y) = 1 for x — x' and p(x\y) = otherwise. 

It should be emphasized that local reversibility of the network is essential, 
i.e., it is not sufficient that the computed function F is bijective. In order to 
compute F~ l efficiently we have inverted each single gate. 5 

5 An example for a bijective function whose inverse is believed to be not efficiently 
computable (because it is not locally invertible) is f(n) = (a n ) modb where a and b are 
chosen appropriately and n E {0, . . . , b— 1}. The security of the crypto-system RSA relies 
on the assumption that the inverse of this function is hard to find. 
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Figure 4: Reversible network as a model for a cause-effect relation. Left: The 
"cause variable X" and the noise variable Z determine jointly the effect variable 
Y and vice versa. Both conditionals p(y\x) and p(x\y) are efficiently computable. 
Right: The effect variable Y does not completely determine the cause variable X 
and the noise Z. The conditional p(x\y) is in general not efficiently computable, 
but p(y\x) is. 
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We conclude: If y is the complete output of a locally reversible circuit we 
can compute p(x\y) efficiently. In other words, using the complete effect of 
the cause, we would be able to compute P (cause) effect) efficiently, no matter 
whether we have probabilistic causality where y is additionally influenced by 
a latent variable. 

It has been argued [26, 27] that logically irreversible functions lead to en- 
ergy dissipation and thus thermodynamically reversible computation is only 
possible by computing only reversible functions [28, 25]. For this reason, the 
"garbage" bits w directly correspond to heat generation. The following theo- 
rem provides an upper bound on the complexity of computing the backwards 
conditional in terms of the number of garbage bits. 

Theorem 5 (complexity of P( cause | effect) and thermodynamics) 

The decision whether p(x\y) > 2/3 or p(x\y) < 1/3 requires at most the 
following steps: (1) the estimation of p(x,y), (2) 2 queries of F' 1 when £ 
is the number of garbage bits, and (3) estimating the probability of 2 e input 
strings. 



Proof: using 

p(y\x)p(x) p(y\x)p(x) 

PWy) p(v) ™)) ' 

the statement is obvious since every term p(F^ 1 (y,w)) can be computed 
using one query of F' 1 followed by the estimation of w)). □ 

Using the reversible embedding, it becomes obvious that the asymmetry 
between cause and effect has been put in by assumption: We have postu- 
lated that all input bits are statistically independent. We could think of the 
time-inverted scenario where x, z is distributed according to some probability 
distribution P having the property that the distribution of y, w is a product 
measure. Then we could efficiently compute p{x\y) applying the method in 
Observation 1 and obtain an efficient simulation of the time-reversed circuit. 
Obviously, such a scenario is unlikely unless we have calculated how to ran- 
domize the input such that a product distribution of the output is obtained. 

This asymmetry becomes a more physical interpretation if we think of the 
bits as states of physical systems that have never been interacting before some 
time to- After they interact, a collective dynamics (represented by the circuit) 
creates stochastic dependences between initially independent systems. 
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5 Common root of the asymmetries 



This section provides a unified view on the origin of the following facts: 

(1) the asymmetry of the computational complexity for the circuit in the 
preceding subsection 

(2) the asymmetry of the causal Markov condition under inverting arrows 

(3) the asymmetry of models with second order Markov kernels 

The common root is the tendency of our environment to subject a sys- 
tem to interactions with an abundance of other physical systems that are 
initially uncorrelated with the former rather than being finally uncorrelated 
(this has already been described for the logical circuit). It is clear, that this 
tendency is linked to other asymmetries between past and future: We see a 
scene happening in front of our eyes shortly after it has happened because 
the photons absorbed by the eyes have obtained correlations with the objects 
at which they were reflected. The physical state of the light beam was un- 
correlated with the object before it interacted with the latter but correlated 
afterwards. This is consistent with Reichenbach's principle, saying that sta- 
tistical dependences have to be explained by interactions in the past but not 
by interactions going to happen in the future. Hence, some evident asym- 
metries between past and future are related to the principle of the common 
cause [29]. 

To discuss these links we will use classical micro-physical toy models: 

(1) Different random variables represent the state of a different physical sys- 
tem or the state of the same system at a different time. This means that the 
value set of the variable is identified with the space of pure states and the 
set of measures on the value set is the spaces of mixed states. 

(2) The space of pure states of a composed system is given by the Cartesian 
product of the spaces of the constituents. 

(3) A physical process of a closed physical system is a bijective map on its 
set of pure states. 

(4) A physical process of an open physical system is a bijective map on the 
Cartesian product of the set of pure states of the system under consideration 
and the set of pure states of an additional system, called the environment. 

Our classical microphysical models are discrete, i.e., one may interpret 
them as quantum systems whose density operators are restricted to those 
being diagonal with respect to some fixed basis. 



30 



5.1 Microphysical model for common causes and com- 
mon effects 

The statistical asymmetry between a causal fork and a causal collider (see 
Fig. 1) is only the simplest case for the asymmetry of the causal Markov 
condition with respect to reversing arrows, but the crucial idea can already 
be seen from this case. 

If Z is the common cause of X and Y we recall 

X±Y\Z but XJLY, (21) 

where X denotes independence and . X .|. is conditional independence. If 
Z is the common effect of two (causally) independent causes we have in the 
generic case 

XJLY\Z but X ±Y (22) 

Already Reichenbach [1] discussed this asymmetry in the context of mixing 
processes in interacting dynamical systems. The following subsection is not 
far from Reichenbach's idea. However, to describe the common thermody- 
namic root of the asymmetry between (21) and (22) on the one hand and 
the asymmetry postulated by the principle of plausible Markov kernels on 
the other hand (Subsection 5.2) we have chosen a class of models that is 
appropriate to discuss both types of asymmetries. 

Causal fork (common cause): Let P(X, Y, Z) be a jointy distribution gener- 
ated by a causal structure where Z is the common cause of X and Y. We 
construct a bijective process acting simultaneously on 6 systems 

Sz X Sx X S]\[x X X Sy X S]\fY • 

Their role is as follows. The initial state of Sz represents the variable Z and 
the final states of Sx and S Y (after some bijective process has acted jointly on 
the 6 systems) represent the variables X and Y, respectively. The time order 
guarantees that Z can only be a cause and not an effect of X and Y. Systems 
Snx and Sny represent background noise that prevents X and Y from being 
deterministic functions of Z (see remarks after Definition 1). The role of 
S' z is a bit more subtle and is easier to explain after the process has been 
described. Let P be a product distribution on ^xSjcxS^x Sxx x 5W and 
let S' z be in an arbitrary pure state. Then we construct a process consisting 
of two steps. 
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(Step 1) Apply a bijective map Fz on 

S z xS' z . 

(Step 2) Apply bijective maps F x and F Y on 

Sz x Sx x and x 5y x S^y , 

respectively. Note that Fz distributes the information contained in Z such 
that it (or at least part of it) is afterwards available on S z and S' z . This 
"broadcasting" of information into two components ensures that S z can have 
an effect on both Sx and S Y even though a direct interaction between Sx 
and S Y is avoided. If F x and F Y both would act (one after another) on S z 
we could not exclude information transfer between them in contradiction to 
our causal model being a fork. 

It is easy to show that processes of the above kind generate a distribution 
with X X Y \ Z. To show that every joint distribution on X, Y, Z satisfying 
p(x,y,z) = p(z)p(x\z)P(y\z) can be generated by a process of this type, we 
assume that S' z starts in a state with zero entropy and F z copies the value 
of z so that it is afterwards available on both systems S z and S' z . Using 
appropriate "noise systems" Sxx and Sxy the maps F x and F Y can certainly 
generate any desired transition matrices p(x\z) and p(y\z), respectively. 

Collider (common effect): Here we consider the same 6 systems and the same 
bijections, but in time-reversed order. Let P be a product distribution on 
Sx* SyX Sz* Sxx x Sny x S' z . Then implement F x and F Y as above and F z 
afterwards. Note that X influences the final state of S z via first influencing 
its intermediate state (between steps 1 and 2) and F Y influences the final 
state of Sz via first influencing S' z . Then we have X X Y by assumption, 
but not necessarily X X Y\Z. 

The backward time version of scenario 1 would be the following. The 
joint system is initially correlated in a way that ensures that the application 
of F X ,F Y and F z makes them statistically independent. This would be a 
rather contrived situation. We do not claim that every physical system is 
"initially" uncorrelated from its environment. The essential point making the 
backward scenario unlikely is that the correlations are exactly such that the 
dynamics resolves them into a product state. Since the dynamics is bijective 
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this is unlikely to be the case unless initial state and the process are adjusted 
[30, 31] to each other 6 . 

To be consistent with our model class, we rephrase this asymmetry as 
follows. 

Postulate 1 (Arrow of time in a closed system) 

Let Xj denote the state space of system j. Let the dynamics after some time 
t be given by a bijective map 

F : X -> X 

with 

X := X-j=iXj . 

Let P be a probability distribution on X that formalizes the initial statistical 
state of the system and P o F denote its final state. 

If k is large, it is unlikely that PoF is a product state but P is not (unless 
F has been designed "by hand" in order to transform the non-product state 
into a product state). The reverse scenario, that P is a product state and 
PoF is not, happens quite often. 

A typical permutation of k tuples in the fc-fold Cartesian product creates 
dependences if the initial distribution is a product measure whose entropy is 
not maximal. This can be considered as a model for increasing correlations 
being the typical situation in closed systems. This is certainly directly con- 
nected with the usual arrow of time in statistical physics where interactions 
between particles lead typically to an increase of coarse-grained entropy (cp. 
e.g. [32, 29, 33]). 

In open systems, however, we have to take into account the following 
effect: The restriction of a probability distribution on a Cartesian product 
to a small fraction of subsystem is typically close to a maximal entropy 
distribution (hence a product measure) even though the distribution itself 
may be far away from a product distribution. In quantum systems, we have 
even the stronger statement that the restriction of a typical pure many- 
particle state is so strongly entangled that its restriction to a small fraction 

6 This would mean that initial state and dynamics have algorithmic information in 
common. According to the algorithmic Markov condition postulated in [20] this requires 
a causal connection between these two "objects" . 
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S Y = n systems 




Figure 5: Relaxation process with n + 1 weakly interacting two- level systems. A 
random process distributes the initial energy over the joint system. The initial 
energy of Sx probabilistically influences the the final energy of Sy • This provides 
a model for a binary variable influencing an almost continuous one. On the other 
hand, the initial energy of Sy influences the final energy of Sx- Then an almost 
continuous variable effects a binary variable. 

of subsystems is almost the maximum entropy state [34]. Therefore, it was 
important for the justification of our way of reasoning that we considered 
maps on closed systems by taking the environment explicitly into account 
(in form of Snx and Sny)- Otherwise we could not justify the remark that 
increase of dependences is more typical than resolving dependences. 

5.2 Asymmetry in the shape of conditionals 

Now we describe a scenario where a mixing process of a simple physical sys- 
tem reproduces our second order Markov kernels under appropriate condi- 
tions. Let system Sx be a classical two-level system with energy gap Ex = 1 
and system Sy consist of a large number n two-level systems with energy 
gap Ey = mEx = m as shown in Fig. 5. 

We assume that m grows asymptotically proportional to y/n, i.e., m n = 
CnVn with c n — > c. Moreover, the initial joint distribution of the n + 1 two 
level systems is a product distribution where the upper level of Sx is occupied 
with probability r and the upper level of each system in Sy with probability 
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q. Then we assume that a weak interaction drives a mixing process on the 
joint state space {0, that randomly permutes levels with the same total 
energy. 

We define a binary variable X describing the state of Sx and a variable 
Y that is asymptotically continuous for n — > oo. It's values are given by 

£ — nq 

y ■= — 7^, 

v n 

where £ — 0, . . . , n denotes the total energy of Sy- 

Let Xi, Yi and Xf, Yf refer to the initial and final states, respectively. 
Certainly, Xi and Yi influence Xf and Yf. However, we will focus on two 
variables only and, for instance, say that Xi influences Yf. Then Yi is consid- 
ered as a noise adding further indeterminism to the causal influence. Hence, 
the mixing process can be considered as a model for the causal structures 
Xi — > Yf and Yi — > Xf at the same time. 

We will discuss the process for different choices of q and r in the limit 
n — > oo and show that only the following three cases occur: 

(1) the joint distribution between initial and final variable does not have 
a second order model in any direction, neither the temporal nor the time 
reversed one. 

(2) it has a second order model in both directions. 

(3) a second order model exists only in the temporal direction. 

It will become obvious that the only time asymmetry in the below sce- 
nario is that we assume statistically independent two-level systems as initial 
condition instead of imposing independence as a final condition. 

We introduce the following notation. The joint density p(yf, Xi) is said to 
be in Sx t ~^Y f if p( x f\ x i) an d p(xi) are asymptotically second order Markov 
kernels in the sense of Definition 5 or can be approximated by these type of 
conditionals. 

By the usual central limit theorem, p(yi) is asymptotically Gaussian with 
mean zero and variance q(l — q). In the below discussion, we will always refer 
to the asymptotical case unless the converse is explicitly stated. To compute 
the final distributions we have to distinguish between different regimes of q 
(which corresponds to different initial temperatures of Sy)- 
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Finite temperature: 

Let q 7^ 1/2. We discuss only q < 1/2 since q > 1/2 ("temperature inver- 
sion") is similar with exchanging the role of upper and lower levels. 

X influences Y: Asymptotically, Xf will always be zero. This is intuitively 
clear because the energy gap of Sx tends to infinity. More formal arguments 
can be constructed in analogy to the derivations below. Therefore the whole 
total energy is finally in Sy and p(yf\xi = 0) is Gaussian with mean zero and 
variance q(l — q). 

If Sx starts in its upper state instead, the total energy is shifted by c n \fn 
and p(yf\xi = 1) therefore is Gaussian with mean c and variance q{\ — q). 
This shows that p(yf\x,j) is second order. Since p(xi) is trivially second order 
we obtain a joint distribution p(yf,Xi) in the class Sx^y,- Since p{yf) is a 
Gaussian mixture, it cannot be in Sy^Xi- 

Y influences X: In fact there is no influence because p{xf = 0) = 1. The joint 
distribution p(yi, Xf) is in Sy^Xf and Sx f ^Xi because p(yi, Xf) =p(yi)p(xf) 
and p(i/i) is Gaussian and p(xf) is second order anyway. 

Infinite temperature: Let q = 1/2. Then Sx does not necessarily end up 
in its lower level. We first compute p(xf\xi = 0, yi) for the case of finite n. If 
Sx ends up in its lower state the initial total energy is distributed among 
the subsystems of Sy, otherwise we only have to distribute the energy £i — m. 
The ratio between the number of combinations for both cases provides the 
ratio between the probability to find Sx in its upper or lower level after the 
mixing: 



p(x f = 


l\xi = 0,yi) _ 


£i(£i- l)---(e i -m n + l) 


p(x f 


= 0\xi,yi) (n 


-e i + i)(n-e i + 2)---(n-e i + m n ) 



(23) 



Taking the logarithm of the right hand side and using ^ = ^/nyi + n/2 yields 
after same algebra 

Every Wj tends to zero with 0(l/y/n) and the sum consists of 0{y/n) terms. 
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Due to\n(l + Wj) = Wj + 0{Wf) we thus have 

m n -l m„—l m n -l . .x 

lim V ln(l + Wj) = lim V W, = lim V I - — ) 

j=0 j=0 j=0 vv 7 

= 4q/j — / Axdx = 2c(2yi — c) . 
Jo 

The second equation reduces the expression to the asymptotically relevant 
terms and the third step holds because m n grows with c^/n. Hence 

p(Xf = l\Xj = 0,yj) = e 2c(2y t -c) 

p(x f = 0\xi,yi) 

i.e., 

p(x f \xi = 0, Vi ) = - (l ± tanh[2c(2y, - c)]) , (24) 

where the signs +, — correspond to Xf = 1,0, respectively. For Xi = 1 the 
initial total energy is ti + m instead of £i and yi is thus replaced with y^ + c: 

p{x f \xi = l, yi ) = ^(l ± tanh[2c(2 yi + c)]) . (25) 



F influences X: For r = 0, 1 the conditional is given by (24) or (25), 

respectively. Hence p(xf,yi) is in Sy^Xf because p(yi) is a Gaussian with 
mean zero and variance 1/4, i.e., 




To see that p(xf,yi) is not in Sx j ^y 1 we recall that only for the trivial cases 
the joint distribution is second order in both directions (see Section 2). 

For r ^ 0, 1 we obtain a mixture of the conditionals (24) and (25), which 
is no longer of second order and hence p(xf, yi) is not in Sy^Xf To see that 
it is not in Sx f ->Yi either, we observe 

p(Vi,x f = l) = p(x f = l\yi)p(yi) 

= X - (\ + tanh[2c(2^ + c)] + tanh[2c(2y i - c)]) yj^ e" 2 ^ , 
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which is not proportional to a Gaussian as it should be. 

X influences Y: To compute p(yf\xi) we observe 

P(Vf\ x i) = ^2p(yf,Xf\xi,yi)p(yi), 

because X i and Y { are independent. Since Vi = Vf + c(xf — Xi) we obtain 

P(Vf\ x i) = ^2p[ x f Xi,yi = yf + c(xf-Xif)p(yi = y f + c(xf-Xif) 

x f 

= ^ (l + tanh[2c(2y / - 2cx t + c)]) y^e^f 

+ X - (\ - tanh[2c(2y / - 2cx t - c)]) ^e' 2 ^-^ . 

Plotting p(jjf\xi) for fixed Xi shows that it is not Gaussian for fixed unless 
c = 0. Likewise, p(yf) is not Gaussian. Hence p(yf,Xi) neither is in Sx^Yf 
nor in Sy^x,- 

Spin systems are actually quantum systems. In order to further support 
the general idea we want to sketch a corresponding quantum scenario. 

Let Sx and Sy be described by the Hilbert spaces Hx '■= C 2 and Hy '■ = 
(C 2 ) n , respectively and S x start in its lover level |0). We assume that S Y 
starts in an eigenstate e Hy of the total energy with eigenvalue i. 

Now we discuss what a typical energy- conserving unitary map on 'Hx®'Hy 
does. The space of states with total energy t splits up into the space 

H := |0)<g>&, 

where Gi consists of all states in Hy having total spin £, and 

Hx := \l)<S>Ge-m. 

Obviously, the quotient of the dimensions coincides with the quotient of the 
number of combinations given by the right hand side of eq. (23). 

After the unitary process has been applied, we have a state of the form 

|o)<g> |Vo) + |i)<e> hM, 
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with ) £ Qi an d IV'i) ^ Qi-m- The probability that Sx is found in its upper 
level is then given by || l^i) || 2 - The following lemma shows that in high di- 
mensions almost every state in H (BHi has the property that || \ipi) || 2 /|| |t/>o)|| 2 
is close to the quotient of the dimensions of Hi and Ho. 

Lemma 2 Let Hi and H2 be two Hilbert spaces of dimension d\ and d 2 , 
respectively. Let \ip) := \ipi) © \ip 2 ) be a randomly chosen vector according to 
the Haar measure of SU(di + d 2 ) . Then, for every r] > 0, the probability that 
I II ^1) II 2 1 II ^2) || 2 — di/d 2 \ > i] tends to zero for di + d 2 — > 00 . 

Sketch of the proof: Define a function / on H by 

:= ^|Pi|^>, 

where Pi is the projector onto Hi- The function / is Lipschitz-constant 
with L = 2. It is easy to show that the average of / over SU (di + d 2 ) is 
di/d 2 . Otherwise the Haar measure would not be invariant with respect to 
permutations of basis vectors. Then the Lemma follows from Levis Lemma 
[34]: Given a Lipschitz continuous function / on a unit sphere, the volume 
of the region where / is not close to its average can be bounded from above 
in terms of L. For growing dimension the bound tends to zero. □ 

This shows that the probability to find Sx finally in its upper level depends 
on the same way on £ as in the classical scenario. Hence we reproduce the 
second order Markov kernel in eq. (24). 

However, the difference is that the outcome for ^-measurements even is 
probabilistic when one specific pure initial state and one typical unitary is 
considered (without requiring any further randomness). 

6 Relation to non-equilibrium thermodynam- 
ics 

Explaining the postulated asymmetries via mixing processes, as we have 
done repeatedly, suggests to consider the topics discussed by this paper as 
part of non-equilibrium thermodynamics. However, a priori, it is not clear 
whether there could also be statistical asymmetries between cause and effect 
in thermal equilibrium. 
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We first consider the difference between past and future in stochastic 
processes (X t ) te z where the answer is negative. If the factorization 



P(X t ,X t ^) = P(X t ^)P(X t \X t ^) 



leads to simpler terms than the factorization into P(X t )P(X t _ 1 \X t ) in any 
sense then we must have a violation of the symmetry 



If the (possibly vector-valued) variable X t describes the state of a physical 
system in phase space at time t eq. (26) is just another formulation of the so- 
called detailed balance condition which is known to hold in Gibbs equilibrium 
[35], but not in non-equilibrium steady states [36]. This shows that statis- 
tical asymmetries between past and future require non-equilibrium states. 
In the literature, such asymmetries have been discussed for various types of 
non-equilibrium steady states, e.g. [37, 38] as well as the relation to thermo- 
dynamic irreversibily. 

To explore the importance of non-equilibrium for the models discussed in 
this paper, we first consider an extremely simplified quantum model of the 
dynamics in the Stern-Gerlach experiment. Define the Hilbert space 



where the set of square integrable function encodes the momentum degree of 
freedom in transversal direction and the two-dimensional component repre- 
sents the spin. We assume, for simplicity, that the only Hamiltonian that is 
relevant inside the furnace is the Hamiltonian H of the harmonic oscillator 
corresponding to the confining potential. Hence, the joint Hamiltonian of 
spin and momentum is then given by H ® 1. In thermal equilibrium we have 
the state 



where \n) with = 1,2, . . . denotes the eigenstates of the oscillator and q n 
the Boltzmann probabilities corresponding to the considered temperature. 

After the atoms leave the furnace, the oscillator potential is no longer 
effective and the system is no longer in equilibrium. The inhomogeneous 
field generates a dynamics that entangles spin and translational degree of 



P(X t = r, X t _! = s) = P(X t = s, X t _! = r) . 



(26) 



H := L 2 (R) <g> C 2 , 




n 
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freedom. We assume that the position degrees of freedom corresponding 
to other directions than the transversal direction under consideration are 
irrelevant and we have free motion in longitudinal direction. 

When the atom arrives at the screen its state has been transformed to 
U(p® 1)W with the unitary map 

tf:=tfi®U)UI + tfT®ITXTI, 

where | |), | f) denote the two possible spin states and U^U^ are unitary 
operators that act on the transversal degree of freedom in a spin-dependent 
way. When the atoms leave the furnace and enter the field, a unitary dy- 
namics transforms the state, i.e., the system is no longer in equilibrium. The 
relation between cause and effect in this example is therefore generated in a 
non-equilibrium dynamics, i.e., by removing the constraints like the oscillator 
potential. 

To see how the form of the relevant quantum states is related to the 
non-equilibrium dynamics, we add the following observations. The states 

U^pU\ and U iP U\ 

are Gibbs equilibrium states for the transformed Hamiltonians 

U\HU^ and U\HU i . 

If we assume that and U± are simple dynamical evolutions like translations, 
these are, again, simple Hamiltonians. Hence the conditional state of the 
system representing the effect, given a fixed value of the cause variable, is a 
Gibbs state for a simple Hamiltonian. 

On the other hand, the marginal state of the effect system itself is given 
by (U^pU^ + UipU^/2. The formal Hamiltonian that can be obtained from 
the logarithm of such a mixture, does not have any direct physical meaning 7 
and need not be simple. 

If cause and effect are represented by the states of two physical systems (at 
the same time instant), one influencing the other with negligible back action, 
we are faced with the question whether such kind of causal unidirectionality 

7 Jaynes stated in the context of non-equilibrium thermodynamics [39]: "[...] wc must 
learn how to construct ensembles which describe not only the present values of macroscopic 
quantities, but also whatever information we have about their past behavior." 
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already requires thermal non-equilibrium. To discuss this, we revisit the 
setting of Subsection 3.4 but with T\ = T 2 = T. Then the joint distribution 
of the systems reads 

e -f3H(x,y) 

P( x >y"> = y e -/3H(x,y) • 

Recall now the sender /receiver protocol where system 1 was randomly ad- 
justed to some value x according to the marginal distribution p(x). The 
conditional p^(y\x) will then coincide with the usual equilibrium conditional 
p(y\x). Hence the intervention preserves the usual equilibrium state. For 
symmetry reasons, this holds clearly for adjusting system 2, too. But then 
the backward and the forward information coincide exactly, i.e., 

I^(X : Y) = I^(X : Y) . 

Hence, different temperatures were really needed in Subsection 3.4 to obtain 
a definite causal direction. 

We want to revisit the second example in Section 2 (with the spins in a 
magnetic field) in light of this result. The interaction between the field and 
the spin cannot be an interaction between two systems in Gibbs equilibrium 
with a common temperature, otherwise the field would be influenced by the 
probe spin in the same way as vice versa, in contradiction to our assumption 
on the definite causal direction. To show this, we assume that the field is 
generated by n spin 1/2 particles. Let 



S z := 



3=1 



be their total spin in z direction, where ai^ denotes the Pauli matrix a z 
on spin j. The free Hamiltonian of the n-spin system when subjected to a 
magnetic field B in z direction is given by 

H := BS Z <g> 1 . 

The free Hamiltonian of the probe spin system is B(l <g) a z ). 

In its thermal equilibrium, the total spin S z follows a binomial distribution 
B q {k) with q/(l — q) = exp(— 1/kT). For large n, the total magnetic moment 
fluctuates on the scale ^fn. Then we introduce an interaction Hi by 

Hi := c—=S z <g> a z , 
'n 
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with a constant c determining the interaction strength. The scaling factor 
l/y/n is chosen such that the total field strength "felt" by the probe spin 
system follows a well-defined distribution in the limit n — > oo. Now we 
consider the conditional probability for k spins up given that the probe spin 
is in its upper state. The total spin of the n-particle system defines an 
integer- valued random variable Y . We have 

p(y = k\x = 1/2) = B q +(k) 

and 

p(y = k\x = -1/2) = B q -(k) 

with q ± are given by q ± /(l — q ,± ) = exp((l ± y/TJn)/kT). In the limit of 
large n the binomial distributions can be replaced with two Gaussians with 
standard deviation in the order of \fn. Their mean value differ also on the 
scale y/n. This shows that we obtain a mixture of two Gaussians for the 
distribution of magnetic moments of the n-particle system. Once we adjust 
the probe spin, bimodality disappears. This shows that we have mutual 
influence between probe and the system generating the field. 

We conclude that every example discussed in this paper relies on non- 
equilibrium states. 

7 Conclusions 

We have described several physical settings where the conditional probability 
for an effect given its cause is less complex than the probability for the cause 
given its effect. Here we have considered different notions of complexity, e.g., 
hierarchy of exponential families as well as with respect to computational 
complexity. 

To link this kind of "asymmetric Occam's Razor principles" with the ther- 
modynamic arrow of time we have constructed models where the statistical 
asymmetries between cause and effect are implications of the irreversibility 
of mixing processes. The common root between all the known asymmetries 
is therefore the tendency of specific initial conditions to evolve into typical 
final states. Specific initial conditions can, for instance, be product probabil- 
ity distributions of joint systems that evolve typically to distributions with 
stochastic dependences. The fact that specific initial conditions occur more 
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often than specific final conditions is linked with the second law, or may even 
be considered as its essential content. 

However, appropriate notions of simplicity have yet to be discovered. 
Since it is impossible to draw reliable causal conclusions from statistical ob- 
servations that do not involve interventions, we have to restrict ourselves to 
finding causal inference rules which are often valid. These have to be based 
upon observing which transition probabilities P (effect | cause) are likely to 
occur in nature and which ones are likely to correspond to non-causal condi- 
tionals. To explore this asymmetry in a systematic way as well as its relation 
to the thermodynamics of irreversible processes is an important challenge for 
both machine learning and theoretical physics. 
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